We are looking to understand the relationship between what major a students chooses and that student’s self assessment of the activities they engagement in, their values, and their perceived competencies in additional to their demographic characteristics (which could potentially serve an latent variables for cultural norms around STEM desirability).
Per the prompt:
At the end of this analysis, I wanted to produce a model that is reproducible and extendable to other datases.
If we can predict the major of the students and understand the driving factors there are a few actions that can be taken from the model:
Our data for this analysis came from two differnt sources:
Note: the factbooks were compiled from the Wake Forest Office of Institutional Research site Link
Sample of the data that was extracted from the factbooks:
# Load data
suppressWarnings(suppressMessages(
WFU_stu <- readr::read_csv("./Data/data_for_interviewees.csv") %>%
mutate_if(is.numeric,as.factor) %>%
mutate(GENDER = trimws(toupper(GENDER)),
STEM = case_when(MAJOR == 6 ~"YES-STEM", TRUE ~ "NO-STEM"),
STEM = as.factor(STEM),
FIRST_GEN = fct_collapse(FIRST_GEN, "YES" = c("2"), "NO" = c("1"))) %>%
mutate(RACEgroup = fct_collapse(RACETHN, "Other" = c("1", "3", "5", "7"),
"Asian" = c("2"),
"LatinX" = c("4"),
"White" = c("6"))) %>%
mutate(COLLEGE_GPA = fct_collapse(COLLEGE_GPA, "D or lower" = c("1"),
"C" = c("2"),
"C+" = c("3"),
"B-" = c("4"),
"B" = c("5"),
"B+" = c("6"),
"A-" = c("7"),
"A or A+" = c("8"))) %>%
mutate(HS_GPA = fct_collapse(HS_GPA, "D or lower" = c("1"),
"C" = c("2"),
"C+" = c("3"),
"B-" = c("4"),
"B" = c("5"),
"B+" = c("6"),
"A-" = c("7"),
"A or A+" = c("8"))) %>%
mutate(DISTANCE = fct_collapse(DISTANCE, "5 or less" = c("1"),
"6 - 10" = c("2"),
"11 - 50" = c("3"),
"51 - 100" = c("4"),
"101 - 500" = c("5"),
"Over 500" = c("6"))) %>%
ungroup()
))
EthLook <- readxl::read_excel("./Data/DataFromFactBooks.xlsx", sheet = "LookUp")
suppressWarnings(suppressMessages(
FB <- readxl::read_excel("./Data/DataFromFactBooks.xlsx", sheet = "FC_Data") %>%
gather(3:9, key= "year",value="StuCount") %>%
mutate(year = as.numeric(year),
Gender= trimws(toupper(Gender))) %>%
mutate_if(is.numeric, replace_na, 0) %>%
left_join(., EthLook) %>%
mutate(ID = as_factor(ID),
RACEgroup = fct_collapse(ID, "Other" = c("-1","1", "3", "5", "7"),
"Asian" = c("2"),
"LatinX" = c("4"),
"White" = c("6")))
))
suppressWarnings(suppressMessages(
WFU_stu %>% group_by(YEAR) %>% summarise(CountStu_response = n()) %>%
left_join(.,
(FB %>% mutate(year = as_factor(year)) %>% group_by(year) %>% summarise(TotalStu = sum(StuCount, na.rm=TRUE))),
by= c("YEAR" = "year")) %>%
mutate(PercentResponse = CountStu_response/TotalStu) %>%
mutate(PercentResponse = scales::percent(PercentResponse)) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
))
| YEAR | CountStu_response | TotalStu | PercentResponse |
|---|---|---|---|
| 2012 | 513 | 1234 | 41.6% |
| 2013 | 594 | 1230 | 48.3% |
| 2014 | 395 | 1287 | 30.7% |
| 2015 | 498 | 1284 | 38.8% |
Note for future study: What happened within 2014 where the respose rate was substantially lower than the other years?
visdat::vis_dat(WFU_stu, sort_type = FALSE, palette = "cb_safe") +
coord_flip() +
theme_bw()
visdat::vis_miss(WFU_stu) + coord_flip() + theme_bw()
Compare survey response to student population using WFU fact book
The first we want look at is whether or not the survey respondents are in line with the student population during the same period of time.
Decided to use a Chi-square test of independence, which is commonly used for comparing categorical data. In our case we are testing if the proportion matches (which we would call our null hypothesis). If the p-value of the chi-square test is very small than we would reject the null hypothesis and say that our two populations are different.
Looking into the comparison of gender and ethnicity
WFU_Resp <- WFU_stu %>%
group_by(GENDER, RACEgroup) %>% summarise(CountStu_Samp = n()) %>%
ungroup() %>%
mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp))
TotalPop <- FB %>% filter(year <= 2015) %>% group_by(Gender, RACEgroup) %>%
summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>%
ungroup() %>%
mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))
ComparSamp <- left_join(TotalPop, WFU_Resp,
by = c("Gender" = "GENDER", "RACEgroup" = "RACEgroup")) %>%
mutate_if(is.numeric, replace_na, 0) %>%
mutate(NonRep_wt = Pop_Perc/Samp_Perc)
ComparSamp %>%
round_df(2) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| Gender | RACEgroup | CountStu_Pop | Pop_Perc | CountStu_Samp | Samp_Perc | NonRep_wt |
|---|---|---|---|---|---|---|
| FEMALE | Other | 247 | 0.05 | 115 | 0.06 | 0.85 |
| FEMALE | Asian | 300 | 0.06 | 145 | 0.07 | 0.82 |
| FEMALE | LatinX | 199 | 0.04 | 118 | 0.06 | 0.67 |
| FEMALE | White | 1903 | 0.38 | 821 | 0.41 | 0.92 |
| MALE | Other | 228 | 0.05 | 83 | 0.04 | 1.09 |
| MALE | Asian | 242 | 0.05 | 99 | 0.05 | 0.97 |
| MALE | LatinX | 155 | 0.03 | 61 | 0.03 | 1.01 |
| MALE | White | 1761 | 0.35 | 558 | 0.28 | 1.25 |
chisq.test(ComparSamp$CountStu_Samp, p = ComparSamp$Pop_Perc)
##
## Chi-squared test for given probabilities
##
## data: ComparSamp$CountStu_Samp
## X-squared = 62.656, df = 7, p-value = 4.442e-11
## with a p-value ~0, means that the sample distribution doesn't match the population
GEN_ETH_wt <- ComparSamp %>% select(Gender, RACEgroup, NonRep_wt)
Here, as with the two tests below, we find that the sample response rate does not represent the student population.
Performing a chi-square test on the response counts of different genders
#chi square test for gender...
WFU_Resp <- WFU_stu %>%
group_by(GENDER) %>% summarise(CountStu_Samp = n()) %>%
ungroup() %>%
mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp))
TotalPop <- FB %>% filter(year <= 2015) %>% group_by(Gender) %>%
summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>%
ungroup() %>%
mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))
ComparSamp <- left_join(TotalPop, WFU_Resp,
by = c("Gender" = "GENDER")) %>%
mutate_if(is.numeric, replace_na, 0) %>%
mutate(NonRep_wt = Pop_Perc/Samp_Perc)
ComparSamp %>%
round_df(2) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| Gender | CountStu_Pop | Pop_Perc | CountStu_Samp | Samp_Perc | NonRep_wt |
|---|---|---|---|---|---|
| FEMALE | 2649 | 0.53 | 1199 | 0.6 | 0.88 |
| MALE | 2386 | 0.47 | 801 | 0.4 | 1.18 |
chisq.test(ComparSamp$CountStu_Samp, p= ComparSamp$Pop_Perc)
##
## Chi-squared test for given probabilities
##
## data: ComparSamp$CountStu_Samp
## X-squared = 43.198, df = 1, p-value = 4.947e-11
## with a p-value ~0, means that the sample distribution doesn't match the population
With a low p-value the Chi-square test, we reject the null hypothesis and determine that the survey response does not match the proportions of the student population for gender.
Looking at the ethnicity comparison
WFU_Resp <- WFU_stu %>%
group_by(RACEgroup) %>% summarise(CountStu_Samp = n()) %>%
ungroup() %>%
mutate(Samp_Perc = CountStu_Samp/sum(CountStu_Samp),
RACEgroup = RACEgroup)
TotalPop <- FB %>% filter(year <= 2015) %>% group_by(RACEgroup) %>%
summarise(CountStu_Pop = sum(StuCount, na.rm = TRUE)) %>%
ungroup() %>%
mutate(Pop_Perc = CountStu_Pop/sum(CountStu_Pop))
ComparSamp <- left_join(TotalPop, WFU_Resp,
by = c("RACEgroup" = "RACEgroup")) %>%
mutate_if(is.numeric, replace_na, 0) %>%
mutate(NonRep_wt = Pop_Perc/Samp_Perc)
ComparSamp %>%
round_df(2) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| RACEgroup | CountStu_Pop | Pop_Perc | CountStu_Samp | Samp_Perc | NonRep_wt |
|---|---|---|---|---|---|
| Other | 475 | 0.09 | 198 | 0.10 | 0.95 |
| Asian | 542 | 0.11 | 244 | 0.12 | 0.88 |
| LatinX | 354 | 0.07 | 179 | 0.09 | 0.79 |
| White | 3664 | 0.73 | 1379 | 0.69 | 1.06 |
chisq.test(ComparSamp$CountStu_Samp, p = ComparSamp$Pop_Perc)
##
## Chi-squared test for given probabilities
##
## data: ComparSamp$CountStu_Samp
## X-squared = 18.778, df = 3, p-value = 0.0003039
Chi-square would indicate that the survey response does not match the proportions of the student population for Ethnicity
By comparing the sample proportions to the population within the factbooks via a Chi-Square test we find that the sample does not align with the population.
This misalignment is likely due to non-response and from calculating the non-response weight (NonRep_wt) we see that:
We will use these weights when reviewing our descriptive stats.
GEN_ETH_wt %>%
arrange(desc(NonRep_wt)) %>%
round_df(., 3) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| Gender | RACEgroup | NonRep_wt |
|---|---|---|
| MALE | White | 1.254 |
| MALE | Other | 1.091 |
| MALE | LatinX | 1.009 |
| MALE | Asian | 0.971 |
| FEMALE | White | 0.921 |
| FEMALE | Other | 0.853 |
| FEMALE | Asian | 0.822 |
| FEMALE | LatinX | 0.670 |
Below is an comparison of the count of students who fall within each category.
Demographic overview
plotDEMO <- function(col_name, TitleNM){
col_name <- enquo(col_name)
WFU_stu %>%
# select(2:9, STEM, RACEgroup) %>%
left_join(., GEN_ETH_wt, by = c("GENDER" = "Gender", "RACEgroup" = "RACEgroup")) %>%
group_by(STEM, !!col_name) %>%
summarise(WeightedResp = sum(NonRep_wt, na.rm = TRUE)) %>%
ungroup() %>%
group_by(STEM) %>%
mutate(TotalRes = sum(WeightedResp),
Percent = WeightedResp/TotalRes) %>%
ggplot(aes(x=!!col_name, y=Percent))+
geom_bar(stat = "identity", aes(fill=!!col_name))+
geom_text(aes(label = scales::percent(Percent),
y= Percent ), stat= "identity", vjust = -.5) + #, size=3
theme_bw()+
scale_fill_viridis(discrete=TRUE) +
facet_grid( ~ STEM) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks=seq(0, 1, by=0.2)) +
expand_limits(y=c(0,1)) +
ggtitle(paste0(TitleNM, " percentage by STEM group [Adjusted for non-response]"))
}
plotDEMO(GENDER, "Gender")
plotDEMO(RACEgroup, "Aggregated Ethnicity")
plotDEMO(YEAR, "Student Cohort")
plotDEMO(FIRST_GEN, "First Generation")
suppressWarnings(suppressMessages(plotDEMO(DISTANCE, "Distance from Home")))
suppressWarnings(suppressMessages(plotDEMO(HS_GPA, "HS GPA")))
plotDEMO(COLLEGE_GPA, "College GPA")
Responses to ACT Questions These questions asked students about the frequency at which they performed certain activities in the past year. The questions spanned questions about their activity outside of the classroom environment to inside the classroom.
Scale
1: Not at all
2: Occasionally
3: Frequently
Based on my personal bias (as a former STEM undergrad) I would expect the questions that would be most likley to be positively associated with STEM selection are:
ACT09: Look up scientific research articles and resources
ACT13: Write computer code
plotDEMO(ACT01, "ACT01") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT02, "ACT02") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT03, "ACT03") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT04, "ACT04") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT05, "ACT05") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT06, "ACT06") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT07, "ACT07") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT08, "ACT08") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT09, "ACT09") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT10, "ACT10") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT11, "ACT11") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT12, "ACT12") + xlab("Score") + theme(legend.position = "none")
plotDEMO(ACT13, "ACT13") + xlab("Score") + theme(legend.position = "none")
Responses to RATE Questions
These questions asked students about their perception of competence in a few areas. This group of questions fell into 2 general categories. 1) Student perception of interpersonal interactions and 2) student perception of their own academic related compentencies (like mathematical or writing abilities)
Scale
1: A major weakness
2: somewhat weak
3: Average
4: Somewhat strong
5: A major strength
Based on my personal bias, I would expect the questions that would be most likley to be positively associated with STEM selection are:
RATE6: Mathematical ability
I would also expect the sterotype of poor writing ability (RATE7) to potentially result in a negative association to STEM.
plotDEMO(RATE1, "RATE1") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE2, "RATE2") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE3, "RATE3") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE4, "RATE4") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE5, "RATE5") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE6, "RATE6") + xlab("Score") + theme(legend.position = "none")
plotDEMO(RATE7, "RATE7") + xlab("Score") + theme(legend.position = "none")
Responses to AGREE Questions
These questions asked students about how they would assess the value of certain future goals or their perception of what community they belong to.
Scale
1: Strongly disagree
2: Disagree
3: Slightly disagree
4: Slightly agree
5: Agree
6: Strongly agree
Based on my personal bias, I would expect the questions that would be most likley to be positively associated with STEM selection are:
AGREE_01: I have a strong sense of belonging to a community of scientists
AGREE_02: I think of myself as a scientist
AGREE_03: I feel like I belong in the field of science
AGREE_08: Making a theoretical contribution to science is important to me
plotDEMO(AGREE_01, "AGREE_01") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_02, "AGREE_02") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_03, "AGREE_03") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_04, "AGREE_04") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_05, "AGREE_05") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_06, "AGREE_06") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_07, "AGREE_07") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_08, "AGREE_08") + xlab("Score") + theme(legend.position = "none")
plotDEMO(AGREE_09, "AGREE_09") + xlab("Score") + theme(legend.position = "none")
Based on the descriptive statistics review of the variables, the missing data for “DISTANCE” and “HS_GPA”
The goal of the model is to be able to predict a binary outcome that is categorical (is the student likely to choose STEM or not as their major). These types of models are generally called “classification models” and two commonly used ones are Logistic regessions and random forests.
In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a regession line which is then transformed via log. This algorithm is a part of the generalized linear model (glm) family, of which linear regession is a part.
Advantages: Logistic regression was designed for classification, and is very useful for understanding the influence of several independent variables on a single outcome variable.
Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values.
Random forest classifier fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and control over-fitting.
Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.
Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.
Training and Test data
We developing models, often we will split the data into “training” and “test” datasets. This allows to have a dataset which was not used for the creation be used when we test our model accuracy.
In the case of this model we have utilized a 75/25 (training/test) split.
Model Performance
All of these outputs can be calculated using a confusion matrix which visually allows us to see how many students are correctly assigned (true positive and true negative) vs how many aren’t when we use our model with the test dataset.
WFU_model <- WFU_stu %>%
# mutate(STEM = case_when(MAJOR == 6 ~"YES", TRUE ~ "NO")) %>%
# select(-MAJOR, -STUID, -YEAR, -DISTANCE, -HS_GPA) %>%
select(STEM, GENDER, RACEgroup, COLLEGE_GPA, FIRST_GEN,
ACT02, ACT08, ACT09, ACT12, ACT13,
RATE3, RATE5, RATE7,
AGREE_01, AGREE_02, AGREE_03, AGREE_08, AGREE_06, AGREE_07
) %>%
mutate(GENDER = as_factor(GENDER),
STEM = as_factor(STEM))
# Using tidymodels structure ---------
train_test_split <-
rsample::initial_split(
data = WFU_model,
prop = 0.75
)
# train_test_split
train_tbl <- train_test_split %>% training()
test_tbl <- train_test_split %>% testing()
recipe_prepped <- recipe_simple(dataset = train_tbl)
train_baked <- bake(recipe_prepped, new_data = train_tbl)
test_baked <- bake(recipe_prepped, new_data = test_tbl)
# logGLM <- glm(STEM ~ ., data=train_baked, family="binomial")
logistic_glm <-
logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(STEM ~ ., data = train_baked)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# logistic_glm
predictions_glm <- logistic_glm %>%
predict(new_data = test_baked) %>%
bind_cols(test_baked %>% select(STEM))
predictions_glm %>%
metrics(STEM, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy") %>%
round_df(., 3)
# Precision shows how sensitive models are to False Positives (i.e. predicting a customer is leaving when he-she is actually staying) whereas
# Recall looks at how sensitive models are to False Negatives (i.e. forecasting that a customer is staying whilst he-she is in fact leaving).
tibble(
"precision" =
precision(predictions_glm, STEM, .pred_class) %>%
select(.estimate),
"recall" =
recall(predictions_glm, STEM, .pred_class) %>%
select(.estimate)
) %>%
unnest() %>%
round_df(., 3) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| precision | recall |
|---|---|
| 0.972 | 0.963 |
predictions_glm %>%
mutate(Prediction = ifelse(.pred_class == "YES-STEM", 1, 0)) %>%
arrange(Prediction) %>%
mutate(ord = row_number()) %>%
ggplot(aes(x=ord, y=Prediction))+
geom_point(aes(color=STEM), size =2, alpha = .5)+
ylab("Predicted STEM")+
xlab("")+
scale_y_continuous(breaks=seq(0, 1, by=1)) +
expand_limits(y=c(0,1)) +
theme_bw()+
viridis::scale_color_viridis(discrete=TRUE)+
ggtitle("Logistic Regression Plot [Prediction vs Actual for Test data]")
utilizing odds ratio to interpret the elements of the logisitic model
tidy(logistic_glm) %>%
mutate(pval = case_when(p.value < 0.05 ~ "<0.05", TRUE ~ ">0.05"),
LogOdds = exp(estimate)) %>%
arrange(desc(abs(LogOdds))) %>%
round_df(3) %>%
ggplot(aes(x=fct_reorder(term, (estimate)), y=estimate))+
geom_point(aes(color=pval))+
geom_segment(aes(x = term, y = 0, xend = term, yend = estimate), color = "grey50") +
geom_hline(yintercept = 0)+
coord_flip()+
theme_bw()+
viridis::scale_color_viridis(discrete = TRUE)+
xlab("")+
ggtitle("Coefficient Plot: Logistic Regression")
tidy(logistic_glm) %>%
mutate(pval = case_when(p.value < 0.05 ~ "<0.05", TRUE ~ ">0.05"),
Odds = exp(estimate)) %>%
filter(pval == "<0.05") %>%
round_df(2) %>%
rename(LogOdds = estimate) %>%
arrange(desc(abs(Odds))) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| term | LogOdds | std.error | statistic | p.value | pval | Odds |
|---|---|---|---|---|---|---|
| AGREE_016 | 33.68 | 4.98 | 6.76 | 0.00 | <0.05 | 4.235178e+14 |
| RACEgroupAsian | 26.43 | 3.42 | 7.72 | 0.00 | <0.05 | 2.996585e+11 |
| AGREE_086 | 23.97 | 5.07 | 4.73 | 0.00 | <0.05 | 2.583361e+10 |
| AGREE_015 | 22.38 | 3.00 | 7.45 | 0.00 | <0.05 | 5.250958e+09 |
| AGREE_026 | 20.64 | 3.25 | 6.35 | 0.00 | <0.05 | 9.191668e+08 |
| AGREE_035 | 17.73 | 2.42 | 7.33 | 0.00 | <0.05 | 5.011711e+07 |
| AGREE_025 | 17.48 | 2.38 | 7.34 | 0.00 | <0.05 | 3.889190e+07 |
| AGREE_014 | 17.36 | 2.31 | 7.51 | 0.00 | <0.05 | 3.477342e+07 |
| GENDERMALE | 15.18 | 1.93 | 7.86 | 0.00 | <0.05 | 3.932263e+06 |
| AGREE_085 | 14.97 | 2.04 | 7.35 | 0.00 | <0.05 | 3.167682e+06 |
| AGREE_084 | 12.55 | 1.72 | 7.29 | 0.00 | <0.05 | 2.826291e+05 |
| AGREE_034 | 12.17 | 1.78 | 6.83 | 0.00 | <0.05 | 1.920428e+05 |
| AGREE_013 | 12.04 | 1.62 | 7.44 | 0.00 | <0.05 | 1.698043e+05 |
| AGREE_024 | 11.81 | 1.68 | 7.04 | 0.00 | <0.05 | 1.350137e+05 |
| AGREE_033 | 9.49 | 1.39 | 6.81 | 0.00 | <0.05 | 1.328492e+04 |
| AGREE_083 | 7.94 | 1.20 | 6.63 | 0.00 | <0.05 | 2.803020e+03 |
| AGREE_023 | 7.64 | 1.16 | 6.57 | 0.00 | <0.05 | 2.070910e+03 |
| FIRST_GENYES | 7.22 | 1.11 | 6.53 | 0.00 | <0.05 | 1.368710e+03 |
| RACEgroupWhite | 6.44 | 1.15 | 5.61 | 0.00 | <0.05 | 6.263100e+02 |
| AGREE_082 | 5.13 | 0.93 | 5.50 | 0.00 | <0.05 | 1.683500e+02 |
| AGREE_012 | 4.26 | 0.80 | 5.32 | 0.00 | <0.05 | 7.078000e+01 |
| AGREE_032 | 4.26 | 0.82 | 5.17 | 0.00 | <0.05 | 7.067000e+01 |
| AGREE_022 | 3.52 | 0.79 | 4.48 | 0.00 | <0.05 | 3.381000e+01 |
| ACT123 | -1.71 | 0.83 | -2.06 | 0.04 | <0.05 | 1.800000e-01 |
| RATE34 | -2.02 | 0.94 | -2.14 | 0.03 | <0.05 | 1.300000e-01 |
| AGREE_072 | -2.07 | 0.69 | -3.01 | 0.00 | <0.05 | 1.300000e-01 |
| (Intercept) | -42.89 | 18.37 | -2.34 | 0.02 | <0.05 | 0.000000e+00 |
| RATE55 | -6.72 | 1.76 | -3.83 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_062 | -7.54 | 1.09 | -6.91 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_063 | -20.07 | 2.59 | -7.75 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_064 | -38.18 | 5.00 | -7.63 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_065 | -55.86 | 7.15 | -7.81 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_073 | -7.74 | 1.20 | -6.47 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_074 | -13.74 | 1.93 | -7.10 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_075 | -22.74 | 3.34 | -6.82 | 0.00 | <0.05 | 0.000000e+00 |
| AGREE_076 | -32.44 | 4.52 | -7.18 | 0.00 | <0.05 | 0.000000e+00 |
# the odds of getting into an honors class for females (female = 1)over the odds of getting into an honors class for males (female = 0) is exp(.979948) = 2.66.
Utilizing a random forest model to classify if a student will be graduating within a STEM major or not.
# WFU_RF <- WFU_stu %>%
# # mutate(STEM = case_when(MAJOR == 6 ~"YES", TRUE ~ "NO")) %>%
# select(-MAJOR, -STUID, -YEAR, -DISTANCE, -HS_GPA, -RACETHN, -DISTANCEval) %>%
# # select(STEM, GENDER, RACETHN, COLLEGE_GPA, FIRST_GEN, ACT09, ACT12, ACT13, RATE5, RATE7, AGREE_01, AGREE_02, AGREE_03, AGREE_08) %>%
# mutate(GENDER = as_factor(GENDER),
# STEM = as_factor(STEM))
#
# rf_train_test_split <-
# rsample::initial_split(
# data = WFU_RF,
# prop = 0.80
# )
#
# # train_test_split
#
# rf_train_tbl <- rf_train_test_split %>% training()
# rf_test_tbl <- rf_train_test_split %>% testing()
#
# rf_recipe_prepped <- recipe_simple(dataset = rf_train_tbl)
#
# rf_train_baked <- bake(recipe_prepped, new_data = rf_train_tbl)
# rf_test_baked <- bake(recipe_prepped, new_data = rf_test_tbl)
rf_build <-
rand_forest(trees = 1000, mtry = 6, mode = "classification") %>%
set_engine("randomForest", importance=T, localImp = T) %>%
fit(STEM ~ ., data = train_baked)
predictions_rf <- rf_build %>%
predict(new_data = test_baked) %>%
bind_cols(test_baked %>% select(STEM))
predictions_rf %>%
metrics(STEM, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy") %>%
round_df(., 3) %>%
kable(align='c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| .metric | .estimate |
|---|---|
| accuracy | 0.856 |
impt_frame <- measure_importance(rf_build$fit)
## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`
plot_multi_way_importance(impt_frame, no_of_labels = 10)
plot_multi_way_importance(impt_frame, x_measure = "accuracy_decrease", y_measure = "gini_decrease", size_measure = "p_value", no_of_labels = 10)
## Warning: Using alpha for a discrete variable is not advised.
#variable depth
md_frame <- min_depth_distribution(rf_build$fit)
## Warning: Factor `split var` contains implicit NA, consider using
## `forcats::fct_explicit_na`
plot_min_depth_distribution(md_frame, mean_sample = "top_trees") # default mean_sample arg
The confusion matrix allows us to show (based on the test data) how often we are correctly predicting each category.
When we compare the logistic regression vs the random forest we see that the logistic regression model performs better than the random forest model. The logistic regession model also had the added benefit of being significantly easier to interpret than the random forest.
# Confusion matrix for Logistic
predictions_glm %>%
conf_mat(STEM, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)+
theme_bw()+
ggtitle("Confusion Plot: Logistic Model")
# Confusion matrix for RF
predictions_rf %>%
conf_mat(STEM, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)+
theme_bw()+
ggtitle("Confusion Plot: Random Forest Model")
So that are the student characteristics that make them more likely to choose STEM as their major?
If we want to optimize the current model
If we want to correct for reasons why predictive elements
Research Items
A work by Phil Walker
philwalker2012@gmail.com